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ABSTRACT 

We  present  a  system  to  index  and  search  conversational  speech 
using  a  scoring  heuristic  on  the  expected  posterior  counts  of 
phone  n-grams  in  recognition  lattices.  We  report  significant 
improvements  in  retrieval  effectiveness  on  five  human  lan¬ 
guages  over  a  strong  1  -best  baseline.  The  method  is  shown  to 
improve  the  utility  (mean  average  precision)  of  the  retrieved 
lattices’  rank  order  and  to  do  so  with  a  search  cost  negligi¬ 
ble  compared  to  the  fastest  yet  known  methods  for  the  linear 
scanning  of  phonetic  lattices. 

Index  Terms —  Information  retrieval,  Speech  processing, 
Speech  recognition,  Natural  languages,  Natural  language  in¬ 
terfaces 

1.  INTRODUCTION 

Over  the  past  decade,  significant  progress  has  been  made  to¬ 
wards  systems  capable  of  indexing  and  searching  large  vol¬ 
umes  of  spoken  communications!!].  This  area  of  interest 
is  generally  referred  to  as  spoken  document  retrieval  (SDR). 
A  SDR  system  enables  a  user  to  enter  a  natural-language  or 
word-based  query  and  to  retrieve  spoken  documents  (files,  ut¬ 
terances,  etc.)  containing  those  terms. 

Many  current  systems  combine  an  automatic  speech  recog¬ 
nition  (ASR)  process,  which  decodes  speech  into  word-based 
text,  with  a  text-based  information  retrieval  system.  While 
these  systems  perform  reasonably  well  when  applied  to  data 
with  low  ASR  word  error  rates  and  medium  sized  vocabular¬ 
ies  (e.g.,  broadcast  news),  alternative  methods  are  often  nec¬ 
essary.  This  is  particularly  so  in  conversational  speech,  in 
which  irregular  prosody  and  out  of  vocabulary  (OOV)  terms 
are  prevalent.  For  information  seekers,  it  is  precisely  because 
these  OOV  terms  (e.g.,  names  of  people  and  places)  are  rare 
that  they  are  in  formative — and  so  special  care  must  be  taken 
to  support  their  detection.  This  paper  focuses  on  detecting 
these  rare  terms — vocabulary  independent  audio  search. 

Research  in  vocabulary  independent  SDR  has  largely  fo¬ 
cused  on  subword  indexing  methods  [2]  and  efficiently  search¬ 
ing  more  complex  representations  of  the  recognition  hypothe¬ 
ses,  such  as  phonetic  lattices  [3,  4].  Because  a  linear  search 


through  many  lattices  is  still  costly,  two-stage  search  systems 
have  also  been  considered.  Two-stage  search  uses  a  fast,  high 
recall,  low  precision  filtering  system  to  produce  a  candidate 
set  of  lattices  for  further  scanning.  In  [5],  discriminating 
fragments  of  phone  sequences  are  indexed  to  produce  an  un¬ 
ordered  set  of  these  candidates.  Our  work  is  similar  in  that  we 
use  an  inverted  index  on  lattice  features  (in  our  case,  expected 
phone  n-gram  counts)  to  retrieve  the  segments.  Our  approach 
differs  in  its  choice  of  indexing  unit,  its  focus  on  very  fast 
search  (we  do  not  allow  for  a  costly  second  stage)  and  in  that 
we  produce  an  ordered  list  of  lattices.  We  focus  on  improving 
the  utility  (i.e.,  the  rank  order)  of  the  lattices  in  one  stage. 

Our  results  are  also  significant  in  the  breadth  of  languages 
we  examine.  We  report  experiments  in  English,  Spanish,  Man¬ 
darin  Chinese,  Persian  Farsi,  and  Levantine  Arabic  conver¬ 
sational  speech.  The  methods  developed  are  universally  ap¬ 
plicable,  and  have  thusfar  been  extended  to  handle  many  of 
the  world’s  languages.  This  motivates  our  emphasis  on  ap¬ 
proaches  such  as  phonetic  lattice  indexing  which  do  not  re¬ 
quire  the  impracticable  costs  of  training  large  vocabulary  con¬ 
tinuous  speech  recognizers  on  resource  poor  languages. 

2.  LATTICE  GENERATION 

Before  we  can  index  or  search  our  spoken  documents,  we  run 
phonetic  recognition  to  produce  a  compact  set  of  hypotheses — 
a  phonetic  lattice — for  the  phone  sequences  observed  in  the 
audio.  The  acoustic  models  used  to  produce  these  lattices  are 
created  using  HTK’s  embedded  training  functionality.  For  the 
experiments  presented  in  this  work,  new  models  are  trained 
for  each  language.  Prior  to  training,  word-based  transcrip¬ 
tions  are  converted  to  phonemes  using  a  rule-based  translit- 
erator  (RBT)  [6].  The  HMM  models  are  then  trained  as  left- 
context  dependent  phones  and  have  three  states,  each  with  17 
Gaussian  mixtures. 

Phonetic  lattices  are  produced  using  an  alternative  Viterbi 
implementation  called  the  Token  Passing  Model  [7]  contained 
within  HTK.  In  this  formulation,  a  token  is  passed  from  state 
to  state  which  contains  the  log  probability  of  the  current  path 
as  well  as  a  record  of  the  previous  states  or  models  already 
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visited.  At  each  state,  a  copy  of  the  present  token  is  passed  to 
each  of  the  connecting  states.  When  a  state  is  passed  multiple 
tokens,  they  are  each  examined  and  only  the  token  with  the 
highest  probability  is  retained.  For  lattice  generation,  multi¬ 
ple  tokens  can  be  saved  at  each  state  in  order  to  retain  multiple 
hypotheses.  Consequently,  as  the  process  is  directed  to  retain 
more  tokens  at  each  state,  a  larger  number  of  hypotheses  are 
recorded,  resulting  in  a  deeper  lattice.  Increasing  the  number 
of  tokens,  however,  incurs  a  higher  computational  cost,  both 
during  decoding  and,  later,  indexing. 

Many  other  parameters  of  the  lattice  generation  process 
can  also  significantly  impact  the  system’s  balance  of  perfor¬ 
mance  and  speed.  Two  of  the  most  relevant  control  parame¬ 
ters  are  the  insertion  penalty  and  language  model  scale.  It  was 
discovered  empirically  that  the  parameter  combination  that 
maximizes  phone  accuracy  does  not  consequently  maximize 
search  performance.  In  fact,  the  parameters  that  maximize  re¬ 
trieval  accuracy  (measured  as  mean  average  precision),  tend 
to  produce  a  phone  lattice  with  a  moderately  higher  level  of 
insertion  errors.  For  this  work,  we  biased  the  results  against 
our  new  approach  and  chose  parameters  which  roughly  maxi¬ 
mized  mean  average  precision  on  our  baseline  search  system. 

3.  LATTICE  INDEXING 

After  mapping  an  audio  file  to  its  phonetic  lattice  representa¬ 
tion,  we  must  further  transform  it  to  facilitate  efficient  search. 
Specifically,  we  wish  to  extract  a  set  of  features  which  can  be 
stored  and  retrieved  quickly  (e.g.,  in  an  inverted  index)  and 
which  succinctly  represent  the  phonetic  information  observed 
in  the  audio. 

Given  a  lattice  L  containing  many  paths  l  (i.e.,  I  E  L ),  the 
expected  number  of  occurrences  for  phone  n-gram  X  over  all 
paths  is 

EPl[C{X)\  =  YjPL{()Cz{X). 

e&L 

Here,  C((X)  denotes  the  number  of  times  phone  n-gram  X 
occurs  in  lattice  path  l.  The  posterior  distribution  Pl  (£)  is 
defined  as 

p  m  =  exPiZaetS(a)} 

LU  E^explE^/?)}’ 

where  exp{-}  denotes  exponentiation,  as  we  assume  the  score 
S(a)  for  an  arc  a  on  the  path  is  a  log  probability  (perhaps 
simply  the  sum  of  the  acoustic  and  language  model  log  proba¬ 
bilities).  In  practice,  these  values  may  be  efficiently  computed 
using  a  variant  of  the  forward-backward  algorithm.  This  func¬ 
tionality  is  currently  supported  by  the  SRI  language  modeling 
toolkit  [8]. 

For  each  lattice  in  the  corpus,  we  compute  the  expected 
phone  n-gram  counts  for  n  <  N.  Phone  77-grams  having  ex¬ 
pected  count  less  than  r  are  discarded.  These  n-grams  and 


their  associated  counts  are  then  indexed  using  a  straightfor¬ 
ward  inverted  index.  While  larger  t's  decrease  the  total  in¬ 
dex  size,  we  found  the  default  choice  of  r  =  lx  10-4  to 
produce  manageable  indices  and  excellent  results.  A  prelim¬ 
inary  study  did  show  mean  average  precision  monotonically 
decreasing  for  increasing  r.  We  fix  N  at  N  =  5. 

4.  SEARCHING 

A  rule-based  transliterator  (RBT)  is  first  used  to  map  query 
words  into  their  phonetic  components  [6].  This  mapping  may 
use  both  context  sensitive  rules  and,  when  available,  pronun¬ 
ciation  dictionaries. 

After  mapping  the  query  into  it’s  phone  sequence,  we  ex¬ 
tract  a  set  Q  of  phone  subsequences  with  length  n,  N  —  A  < 
n  <  N.  The  integer  A  simply  parameterizes  the  smallest  in¬ 
dexing  unit  used  for  the  search.  Note,  if  the  full  query’s  phone 
sequence  is  smaller  than  N,  the  length  of  the  sequence  is  used 
as  the  largest  unit  for  search.  Naturally,  indexing  sequences 
are  also  constrained  to  have  a  positive  length.  For  example,  if 
we  are  searching  for  goodness  with  N  =  5  and  A  =  1,  we 
first  apply  the  RBT 

goodness  RBT>  [gudnis], 

and  then  extract  the  phone  subsequence  set 

Q  =  {g  u  dn,  u  dn  1,  dn  1  s,  g  u  dn  1,  u  dn  1  s}. 

If  instead  we  are  searching  fox  fun  (with  the  same  N  and  A), 
we  extract  the  subsequence  set 

Q  =  {f  a,  a  n,  f  a  n}. 

We  roughly  expect  a  larger  value  of  A  to  improve  retrieval 
when  phone  recognition  is  very  poor  (i.e.,  when  our  query 
phone  subsequences  will  not  have  accurately  indexed  counts 
for  n  =  N).  On  the  other  hand,  if  A  is  too  large,  the  very 
short  phone  subsequences  utilized  will  be  only  poor  discrim¬ 
inators  for  the  underlying  terms  (e.g.,  lattices  not  containing 
goodness  may  nevertheless  include  the  phone  1-gram  u). 

To  compute  the  score  for  a  query  and  lattice  L,  we  sum 
the  posterior  expected  ?7-gram  counts  associated  with  each  el¬ 
ement  of  Q, 

score{query ,  L)  =  ^loglLpJG^)]}.  (1) 
qdQ 

The  logarithm  can  be  thought  of  as  a  damping  function  which, 
due  to  it’s  singularity  at  log  {0},  acts  to  aggressively  penal¬ 
ize  lattices  having  a  near  zero  count  for  some  phone  subse¬ 
quence.  We  smooth  the  counts  by  giving  absent  subsequences 
a  very  small  count  e«l,  which  parameterize  the  penalty  for 
a  missing  n-gram.  We  found  our  results  to  be  rather  insen¬ 
sitive  to  choice  of  e,  which  we  set  to  e  =  1  x  10-15.  Note, 
because  Q  contains  more  subsequences  of  shorter  lengths  and 
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because  every  short  subsequence  is  itself  a  portion  of  a  longer 
sequence,  lattices  are  most  severely  penalized  if  they  do  not 
contain  the  short  elements  of  Q.  This  conforms  to  our  in¬ 
tuition  that  lattices  lacking  even  the  shortest  phone  subse¬ 
quences  in  Q  are  unlikely  to  correspond  to  our  query  term. 

One  advantage  of  this  search  heuristic  is  that  Equation  1 
may  be  computed  very  efficiently.  Queries  having  a  phone  se¬ 
quence  of  length  to  require  only  \Q\=  J2n=N-A  m  —  n+l 
lookups  in  the  inverted  index — a  marginal  cost  in  compari¬ 
son  to  the  fastest  known  methods  for  the  scanning  of  lattices. 
At  the  same  time,  because  score(query ,  L)  is  proportional  to 
the  probability  of  a  term  occurring  in  the  audio,  it  naturally 
provides  a  suitable  ranking  function  for  the  lattices  [9], 

5.  EXPERIMENTS 

To  evaluate  the  search  performance  of  this  approach,  test  data 
in  the  form  of  audio  and  transcripts  was  assembled  in  the  fol¬ 
lowing  languages:  English,  Spanish,  Mandarin,  Levantine, 
and  Persian.  Each  test  set  was  drawn  from  a  corpora  of  con¬ 
versational  telephone  speech  and  was  excluded  from  data  used 
to  train  the  associated  phonetic  recognizer.  All  data  used  for 
training  and  testing  are  publicly  available  and  were  obtained 
through  the  Linguistic  Data  Consortium1.  Table  1  details  the 
source  and  size  of  each  of  the  test  sets  used  in  the  experi¬ 
ments  presented  here.  A  list  of  query  words  was  generated 
from  the  actual  transcripts  for  the  evaluation  audio  in  each 
language.  With  the  exception  of  Mandarin,  all  words  contain¬ 
ing  3  or  more  characters  were  included  in  the  query  list.  For 
Mandarin,  as  single  characters  correspond  to  whole  words, 
all  words  were  included.  We  have  maintained  the  same  eval¬ 
uation  sets  from  [10]  to  establish  a  strong  baseline  for  our 
measurements. 

As  in  [10],  we  report  mean  average  precision  (MAP).  Pre¬ 
cision  p  is  simply  the  proportion  of  documents  retrieved  (at  a 
point  in  a  ranked  list)  which  are  relevant.  To  compute  MAP, 
the  precisions  at  each  relevant  document  in  a  query’s  ranked 
list  are  averaged.  The  mean  of  these  averages  is  then  com¬ 
puted  over  the  set  of  all  queries.  For  an  average  query’s  ranked 
list,  MAP  gives  the  expected  precision  at  a  relevant  document, 
and  so  is  a  measure  of  the  utility  of  a  search  system  for  a  user. 
Formally,  if  p(r)  denotes  the  precision  at  cut-off  rank  r  for 
a  system  returning  D  documents  with  R  total  relevant  docu¬ 
ments,  and  rel(r )  is  a  binary  function  indicating  the  relevance 
of  a  given  rank,  then  the  average  precision  p  for  a  query  is 

p  =  Ef=i  p(r)rel(r) 

R 

The  MAP  is  then  simply  the  average  of  p  over  all  queries. 

For  the  lattice  search  results  that  follow,  we  used  5  to¬ 
kens  during  the  HTK  decoding  process  to  produce  the  lat¬ 
tices,  indexed  n-grams  of  up  to  length  5,  and  searched  with 

'http : //www. ldc . upenn . edu/ 


3  Tokens 

5  Tokens 

Language 

Speed  (xRT)  MAP 

Speed  (xRT)  MAP 

English 

Spanish 

0.36x  29.6 

0.36x  24.5 

0.89x  30.5 

1.22x  25.4 

Table  2.  Comparison  of  indexing  speed  and  search  perfor¬ 
mance  (in  MAP)  using  3 -token  and  5-token  settings  for  lattice 
generation. 


n-grams  of  length  3,  4,  and  5  using  parameter  values  N  =  5 
and  A  =  2.  We  do  not  claim  any  optimality  in  these  pa¬ 
rameters  and,  although  they  yield  encouraging  results,  they 
certainly  merit  further  examination. 

The  baseline  1-best  search  approach  seeks  to  optimize 
the  search  of  errorful  1  -best  output  using  a  weighted  match 
with  the  query  terms.  The  phones  of  the  query  are  matched 
with  the  phones  of  the  1-best  output  by  a  dynamic  program¬ 
ming  minimum  edit  distance  calculation  that  uses  weights  for 
phone  substitutions,  insertions,  and  deletions  from  a  set  of 
language-specific  confusion  matrices.  The  confusion  matrix 
for  a  language  seeks  to  represent  a  mapping  from  the  pho¬ 
netic  space  of  reference  transcripts  (transliterated  by  the  same 
mechanism  used  for  queries)  to  the  phonetic  space  of  the  rec¬ 
ognizer  output.  This  mapping  encapsulates  recognition  error, 
incorrect  transcription  error,  and  pronunciation  variation  not 
captured  by  the  query  transliterator. 

5.1.  Results 

Our  experiments  with  the  proposed  indexing  and  search  strat¬ 
egy  demonstrate  a  significant  improvement  in  MAP  perfor¬ 
mance  over  our  baseline  1 -best  results.  For  comparison,  Table 
3  includes  both  the  results  from  [  1 0]  and  the  results  of  a  1  -best 
search  using  an  updated  version  of  the  phonetic  recognizer 
used  for  the  lattice  search.  The  improvements  in  the  most  re¬ 
cent  phone  recognizer  are  primarily  due  to  additional  training 
data  and  various  algorithmic  refinements  including:  the  RBT, 
the  forced  alignment  procedure,  language  model  training,  and 
parameter  optimization  (as  mentioned  in  Section  2). 

Table  3  shows  that  the  updated  1  -best  recognizer  produces 
significant  improvements  in  terms  of  both  phonetic  accuracy 
and  mean  average  precision.  However,  substantial  additional 
gains  are  achieved  by  employing  the  proposed  lattice-based 
approach.  Improvements  over  the  newest  1-best  technique 
were  measured  for  each  of  the  five  languages  tested.  The  rel¬ 
ative  improvement  in  MAP  ranged  from  1 5%  for  Mandarin 
Chinese  to  106%  for  Persian  Farsi. 

As  mentioned  previously,  5  tokens  were  used  for  the  lat¬ 
tice  generation  process.  While  this  resulted  in  our  largest  ob¬ 
served  gains  in  retrieval  performance,  the  computational  cost 
of  maintaining  5  tokens  per  state  during  Viterbi  decoding  is 
substantial.  Table  2,  however,  shows  that  the  computational 
cost  of  lattice  generation  can  be  significantly  mitigated  by  re- 
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Language 

Source 

Duration  (minutes) 

Utterances 

Unique  Query  Words 

English 

CallHome 

148.08 

2175 

2204 

Spanish 

CallHome 

225.8 

3908 

3817 

Mandarin 

CallHome 

83.2 

2888 

1845 

Levantine 

EARS 

649.8 

6648 

9890 

Persian 

CallFriend 

239.7 

5903 

4777 

Table  1.  Source  and  collection  statistics  for  search  evaluation  data  sets. 


1-best  (2005) 

1-best  (2006) 

Lattice 

Language 

Acc. 

MAP 

Acc. 

MAP 

MAP 

English 

38.1 

22.5 

43.4 

23.2 

30.5 

Spanish 

48.4 

15 

50.9 

21.5 

25.4 

Mandarin 

37.5 

5.7 

42.2 

9.8 

11.3 

Levantine 

- 

6.9 

37.2 

8.1 

12.3 

Persian 

30.5 

4.0 

43.5 

5.3 

10.9 

Table  3.  Phone  accuracy  and  search  effectiveness  for  five  lan¬ 
guages  using  the  original  1-best  recognized  phonetic  output 
(2005),  the  1-best  output  from  the  updated  phone  recognizer 
(2006),  and  the  proposed  lattice-based  system. 


ducing  the  number  of  tokens  to  3 .  For  both  English  and  Span¬ 
ish,  reducing  the  number  of  tokens  resulted  in  increases  in 
decoding  speeds  by  factors  of  3  and  2,  respectively,  while 
suffering  only  minor  losses  in  MAP.  It  should  be  noted  that 
all  experiments  were  conducted  on  a  Linux  machine  with  an 
AMD  2.0  GHz  processor. 

One  strength  of  our  baseline  system  is  that  it  accounts  for 
the  confusability  on  phones  by  estimates  directly  measured  on 
held  out  data.  The  lattice  method,  on  the  other  hand,  does  not 
incorporate  any  notion  of  nearness  in  mismatched  phones,  so 
that  we  might  expect  it  to  perform  worse  on  languages  with 
low  recognition  accuracy.  Table  3  demonstrates  that  this  is 
not  the  case.  The  lattice  indexing  approach  is  surprisingly  ro¬ 
bust  in  the  presence  of  recognition  error,  presumably  because 
a  sufficient  number  of  alternative  phone  hypotheses  are  rep¬ 
resented  by  the  lattices.  As  Table  2  indicates,  this  remains 
true  even  with  many  fewer  paths  in  the  lattice  (i.e.,  for  fewer 
tokens). 

6.  CONCLUSION 

We  have  proposed  a  lattice  indexing  and  search  procedure 
for  spoken  utterance  retrieval  of  conversational  speech.  The 
method  efficiently  indexes  and  searches  phonetic  lattices,  show¬ 
ing  significant  improvements  in  retrieval  performance  over 
our  baseline,  while  maintaining  a  system  that  is  faster  than 
real-time.  By  demonstrating  significant  performance  increases 
across  five  languages,  we  have  shown  the  method  to  be  sur¬ 
prisingly  robust  both  to  variation  in  human  language  and  the 


error  characteristics  of  phonetic  recognition  systems. 

While  our  purpose  was  to  maximize  the  utility  of  the  re¬ 
trieved  lattices  as  quickly  as  possible,  future  work  might  ex¬ 
tend  this  approach  by  rcranking  the  returned  list  with  a  more 
costly  lattice  scanning  system  [3].  We  also  hope  to  explore 
larger  collection  sizes.  For  very  large  collections,  an  index 
may  potentially  grow  to  become  unmanageable.  To  constrain 
its  size,  we  may  consider  the  careful  selection  of  indexing 
features  by  observable  phonotactic  constraints.  We  hope  we 
have  provided  a  sound  basis  for  this  work. 
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